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ABSTRACT 

Many astrophysical sources, especially compact accreting sources, show strong, random 
brightness fluctuations with broad power spectra in addition to periodic or quasi-periodic 
oscillations (QPOs) that have narrower spectra. The random nature of the dominant source of 
variance greatly complicates the process of searching for possible weak periodic signals. We 
have addressed this problem using the tools of Bayesian statistics; in particular using Markov 
chain Monte Carlo techniques to approximate the posterior distribution of model parame- 
ters, and posterior predictive model checking to assess model fits and search for periodogram 
outliers that may represent periodic signals. The methods developed are applied to two ex- 
ample datasets, both long XMM-Newton observations of highly variable Seyfert 1 galaxies: 
RE J 1034 + 396 and Mrk 766. In both cases a bend (or break) in the power spectrum is evi- 
dent. In the case of RE J 1034 + 396 the previously reported QPO is found but with somewhat 
weaker statistical significance than reported in previous analyses. The difference is due partly 
to the improved continuum modelling, better treatment of nuisance parameters, and partly to 
different data selection methods. 
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1 INTRODUCTION 

A perennial problem in observational astrophysics is detecting pe- 
riodic or almost-periodic signals in noisy time series. The stan- 
dard analysis tool is the periodogram (see e.g. Jenkins & Watts 
1969l:|Priestle\ll98ll;|Press et alll992l;lBloomfield 200d: IChatfieid 



2003), and the problem of period detection amounts to assessing 
whether or not some particular peak in the periodogram is due to a 
peri odic component or a random fluctuation in the noise spectrum 
(see |Fisherlll929l; |Priestlevlll98ll: iLeahv et alj[l983l : Ivan der Klisl 
ll989l : |Percival & Waldenlll993l : lBloomfieldll200d) . 

If the time series is the sum of a random (stochastic) compo- 
nent and a periodic one we may write y(t) = yn{t) + yp(t) and, 
due to the independence of yn(t) and yp(t), the power spectrum of 
y(t) is the sum of the two power spectra of the random and stochas- 
tic processes: Syif) = S n(f) + Sp(f). This is a mixed spectrum 
dPercival & Waldenl 1 19931 , section 4.4) formed from the sum of 
Sp(f), which comprises only narrow features, and Sp(f), which 
is a continuous, broad spectral function. Likewise, we may con- 
sider an evenly sampled, finite time series y(ti) (i — 1,2, ... , N) 
as the sum of two finite time series: one is a realisation of the pe- 
riodic process, the other a random realisation of the stochastic pro- 
cess. We may compute the periodogram (which is an estimator of 
the true power spectrum) from the squared modulus of the Dis- 
crete Fourier Transform (DFT) of the time series, and, as with the 
power spectra, the periodograms of the two processes add linearly: 
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I(fj) = In(fj) + Ip(fj)- The periodogram of the periodic time 
series will contain only narrow "lines" with all the power concen- 
trated in only a few frequencies, whereas the periodogram of the 
stochastic time series will show power spread over many frequen- 
cies. Unfortunately the periodogram of stochastic processes fluc- 
tuates wildly around the true power spectrum, making it difficult 
to distinguish random fluctuations i n the noise spectrum from truly 
spectral periodic components. See Ivan derKlis (1989) for a thor- 
ough review of these issues in the context of X-ray astronomy. 

Particular attention has been given to the special case that 
the spectrum of the stochastic process is flat (a white noise spec- 
trum S(f) = const), which is the case when the time series data 
yn(ti) are independently and identically distributed (IID) random 
variables. Reasonably well-established statistical procedures have 
been developed to help identify spuri ous spectral peaks and reduce 
the chance of false detections (e.g. [fisher 1929; Priestlevl l 198ll : 



ILeahv et all 19831 : Ivan der Klislll989l : |Percival & Waldenlll993l) . In 
contrast there is no comparably well-established procedure in the 
general case that the spectrum of the stochastic process is not flat. 

In a previous paper, IVaugharj (2005) (henceforth |V05|) . we 
proposed what is essentially a generalisation of Fisher's method to 
the case where the noise spectrum is a power law: Sp(f) — /3f~ a 
(where a and /3 are the power law index and normalisation param- 
eters). Processes with power spectra that show a power law depen- 
dence on frequency with a > (i.e. increasing power to lower 
frequencies) are called red n oise and ar e extremely common in as- 
tronomy and els ewhe re (see |Presslll978l) . In this paper we expand 
upon the ideas in V05 and, in particular, address the problem from a 
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Bayesian perspective that allows further generalisation of the spec- 
tral model of the noise. 

The rest of this paper is organised as follows. In section [2] 
we introduce some of the basic concepts of the Bayesian approach 
to statistical inference; readers familiar with this topic may prefer 
to skip this section. Section [3] gives a brief overview of classical 
significance testing using p- values (tail area probabilities) and test 
statistics, and section [4] discusses the posterior predictive p- value, 
a Bayesian counterpart to the classical p-value. Section [5] reviews 
the conventional (classical) approaches to testing for periodogram 
peaks. Section|6]outlines the theory of maximum likelihood estima- 
tion from periodogram data, which is developed into the basis of a 
fully Bayesian analysis in sections [7] and [8] The Bayesian method 
is then applied to two real observations if AGN in section [9] Sec- 
tion[lO]discusses the limitations of the method, and alternative ap- 
proaches to practical data analysis. A few conclusions are given in 
section [TT] and two appendices describe details of the simulations 
algorithms used in the analysis. 



2 BAYESIAN BASICS, BRIEFLY 

There are two main tasks in statistical inference: parameter esti- 
mation and model checking (or comparison). Bayesian parameter 
estimation is concerned with finding the probability of the param- 
eters given the model p(6\x, H), where x (= {x\, . . . , xm}) are 
data values , (= {0\, . . . ,9m}) are parameter values and H rep- 
resents the model. In contrast, frequentist (or classical) statistics 
restricts attention to the sampling distribution of the data given the 
model and parameters p(x\G, H). These two probability functions 
are related by Bayes' Theorem 



p(0\x,H) = 



p(x\0,H)p(0\H) 
p(x\H) 



(1) 



Each of the terms in Bayes' theorem has a name when used in 
Bayesian data analysis: p(6\x, H) is the posterior distribution of 
the parameters; p(x\6,H) is the likelihood function of the pa- 
rameter^ p{6\H) is the prior distribution of the parameters, and 
p(x\H) is a normalising constant sometimes referred to as the 
marginal likelihood (of the data) or the prior predictive distri- 
bution^ General introductions to Bayesi a n analysis for the non- 
spe cialist include Jeffreys & Berger ( 1992), Berg er & Berrvi 1 1988) 
and Howson & Urbact J d 199 II) : more t horough treatments i nclude 
[ Berry! jl 9961) ICarlin & Louis! fcOOOl) . iGelman et all d2004l) . and 
I Led ( 120041) : and discussions more focuss e d on phy s ics an d as- 
trophys i cs problems include ISivial dl996t) . iGregorvl d2005l) . and 
lLoredd < fl990l . 1 1 992t) . 

In Bayesian analysis the posterior distribution is a complete 
summary of our inference about the parameters given the data x, 
model H, and any prior information. But this can be further sum- 
marised using a point estimate for the parameters such as the mean, 
median or mode of the posterior distribution. For one parameter the 
posterior mean is 



E[6\x,H] 



I 



x,H)d9 



(2) 



1 Note that when considered as a function of the data for known parame- 
ters, p(x\0, H) is the sampling distribution of the data, but when consid- 
ered as a function of the parameters for fixed data, p(x\9, H) is known as 

the likelihood, sometimes denoted L(0). 

Som e physicists use the evidence for this term, e.g. ISivial jl996l) , lTrottj 
<2008t) . 



A slightly more informative summary is a credible interval (or 
credible region for multiple parameters). This is an interval in pa- 
rameter space that contain a specified probability mass (e.g. 90 per 
cent) of the posterior distribution. These intervals give an indication 
of the uncertainty on the point inferences. 



p{6\x,H)d6 = G, 



(3) 



where C is the probability content (e.g. C = 0.9) and R is interval 
in parameter space. One common approach is to select the interval 
satisfying equation [3] that contains the highest (posterior) density 
(i.e. the posterior density at any point inside is higher than at any 
point outside). This will give the smallest interval that contains a 
probability C, usually called the highest posterior density region 
(abbreviated to HDR or HPD interval by different authors). An al- 
ternative is the equal tail posterior interval, which is defined by the 
two values above and below which is (1 — C)/2 of th e posterior 
probab ility. These two types of interval are illustrated in lPark et al.l 
(120081, see their Fig. 1). 

If we have multiple parameters but are interested in only one 
parameter we may marginalize over the other parameters. For ex- 
ample, if 6 — {6i,02} then the posterior distribution for 0\ is 



p(9 1 \x,H) 



P (9 1 ,02\x,H)d9 2 

P {9 1 \02,X,H)p(02\x,H)d02 



(4) 



This is the average of the joint posterior p(0i, 02\x, H) over 02- 
In the second formulation the joint posterior has been factored into 
two distributions, the first is the conditional posterior of 0\ given 
02 and the second is the posterior density for 02. 

Most present day Bayesian analysis is carried out with the aid 
of Monte Carlo methods for evaluating the necessary integrals. In 
particular, if we have a method for simulating a random sample of 
size N from the posterior distribution p(9\x, H) then the posterior 
density may be approximated by a histogram of the random draws. 
This gives essentially complete information about the posterior (for 
a sufficiently large iV). The posterior mean may be approximated 
by the sample mean 



E[e\x,H]*±J2 ei 



(5) 



where % are the individual simulations from the posterior. If the 
parameter is a vector = {0i, . . . , 0m}, the mth component of 
each vector is a sample from the marginal distribution of the mth 
parameter. This means the posterior mean of the each parameter is 
approximated by the sample mean of each component of the vec- 
tor. Intervals may be calculated from the sample quantiles, e.g. the 
90 per cent equal tail area interval on a parameter may be approx- 
imated by the interval between the 0.05 and 0.95 quantiles of the 
sample. In this manner the difficult (sometimes insoluble) integrals 
of equations [2] [3] and [4] may be replaced by trivial operations on 
the random sample. The accuracy of these approximations is gov- 
erned by the accuracy with which the distribution of the simulations 
matches the posterior density, and the size of the random sample N. 
Much of the work on practical Bayesian data analysis methods has 
been devoted to the generation and assessment of accurate Monte 
Carlo methods, particularly the use of Markov chain Monte Carlo 
(MCMC) methods, which will be discussed and used later in this 
paper. 

For model comparison we may again use Bayes theorem to 
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Table 1. Definitions used throughout the paper. 



Term Definition 



fj 


The jth Fourier frequency fj = j/NAT (j = 1, . . . , N/2) 


Ij 


Periodogram at frequency fj 


I 


vector of periodogram values I = {Ji, . . . , /jv/2} 





Model parameters = [9%, . . . , 8^} 


Omle 


Maximum Likelihood Estimates of parameters (equationllSl 


X 


Data (e.g. time series) x = {xi, . . . , Xff} 


PC 


Frequentist/classical (conditional) p-value (equation [To) 


PB 


Bayesian (posterior predictive) p-value (equationll2) 


Sj 


Model spectral density at frequency fj, i.e. S(fj; 0) 


s 3 


The model computed at the estimate (equationll5t 


D(x,0) 


Deviance (—2 log p(x \0)) given model H (equationll7t 


p(x\0,H) 


Likelihood for parameters of model H given data x (equation[T} 


p(0\H) 


Prior probability density for parameters (equation [TJ 


p(0\x,H) 


Posterior probability density for parameters given data x (equation[TJ 


p(x\H) 


Prior predictive density (aka marginal likelihood) of the data x (equation[TJ 


£C rC P 


Replicated data (from repeat observations or simulations) (equation 11 It 


p(cC r °P|£C obs ) 


Posterior predictive distribution given data x ohB (equationllll 


T(x) 


A test statistic 



give the posterior probability for model Hi 

p(x\H z )p(H t ) 



p(Hi\x) 



p{x) 



(6) 



and then compare the posterior probabilities for two (or more) com- 
peting models, say Ho and Hi (with parameters Go and 61, respec- 
tively). (In effect we are treating the choice of model, i/j, as a dis- 
crete parameter.) The ratio of these two eliminates the term in the 
denominator (which has no dependence on model selection): 

p(£Ti|aO _p{x\H x )p{Hi) 



O = 



(7) 



p(H \x) p(x\H )p(H ) 

The first term on the right hand side of equation [7] is the ratio of 
likeli hoods and is often called the Bayes factor (see lKass & Raftervl 
Il995l) and the second term is the ratio of the priors. However, in 
order to obtain p(Hi\x) we must first remove the dependence of 
the posterior distributions on their parameters, often called nui- 
sance parameters in this context (we are not interested in making 
inferences about Gi, but they are necessary in order to compute 
the model). In order to do this the full likelihood function must be 
integrated or marginalized over the joint prior probability density 
function (PDF) of the parameters: 



(8) 



p(x\Hi) = J p(x\0i,Hi)p(Oi\Hi)dOi 

Here, p(x\8i, Hi) is the likelihood and p(6i\Hi) the prior for the 
parameters of model Hi . 



3 TEST STATISTICS AND SIGNIFICANCE TESTING 

We return briefly to the realm of frequentist statistics and consider 
the idea of significance testing using a test statistic. A test statistic 
T(x) is a real-valued function of the data chosen such that extreme 
values are unlikely when the null hypothesis Hq is true. If the sam- 
pling distribution of T is p(T\Ho), under the null hypothesis, and 
the observed value is T° hs = T(x° hB ), then the classical p-value 
is 



pc(x 



p(T\H )dT = Pi{T(x" 



> T(x° hB )\Ho}, 



(9) 

where Pra% is the probability of event x given that event y oc- 
cured. The second formulation is in terms of replicated data that 
could have been observed, or could be observ ed in repeat experi- 
ments jMendll994l : lGelman et al.lll996tl2004) . The p-value gives 
the fraction of p(T\Ho) lying above the observed value T obs . As 
such, p- values are tail area probabilities, and one usually uses small 
pc as evidence against the null hypothesis. If the null hypothesis 
is simple, i.e. has no free parameters, or the sampling distribution 
of T is independent of any free parameters, then the test statistic is 
said to be pivotal. If the distribution of the test statistic does depend 
on the parameters of the model, i.e p(T\G, Ho), as is often the case, 
then we have a conditional p-value 

/• + oo 

pc(x ohs ,0) = / p{T\8)dT = Pr{T(:z rcp ) > T(a; obs )|0} 

(10) 

(For clarity we have omitted the explicit conditioning on Ho-) In 
order to compute this we must have an estimate for the nuisance 
parameters 6. 



4 POSTERIOR PREDICTIVE P- VALUES 



In Bayesian analysis the posterior predictive distribution is the dis- 
tribution of x rcp given the available information which includes 
a; obs and any prior information. 



/ rep 1 obs\ 

p(x \X ) 



p{x iep \e)p(e\x obB )do, 



(11) 



(e.g. section 6.3 of lGelman et al.l2 004). Here, p(0\x° hs ) is the pos- 
terior distribution of the parameters (see eqn. QJ and p(x rcp \G) is 
the sampling distribution of the data given the parameters. The 
Bayesian p-value is the (tail area) probability that replicated data 
could give a test statistic at least as extreme as that observed. 



Pb{x) 



p C {x ohs ,o)p{e\x ohs )do 



Pr{T(x Tep ) > T(x obs )\x ohs ,H } 



(12) 
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This is just the classical p-value (eqn. UOt averaged over the 
posterior distribution of 6 (eqn. Q3, i.e. the posterior mean 
Fi[pc\x° hB , Ho] which may be calculated using simulations. In 
other words, it gives the average of the conditional p-values eval- 
uated at over the range of parameter values, weighted by the (pos- 
terior) probability of the parameter values. The aim of the poste- 
rior predictive p- value (or, more generally, comparing the observed 
value of a test statistic to its posterior predictive distribution) is to 
provide a simple assessment of whether the data are similar (in im- 
portant ways) to the data expected under a particular model. 

This tail area probability does not depend on the unknown 
value of par ameters 0, and is often called the posterior predictive 
p- value (seelRubinlll984 lMendll994l ; IGelman et al] 1 19961 |2004 
IProtassov et alJt2002l) . The (classical) conditional p-value and the 
(Bayesian) posterior predictive p-value are in general different but 
are equivalent in two special cases. If the null hypothesis is sim- 
ple or the test statistic T is pivotal, then the sampling and posterior 
predictive distributions of T are the same, pc = ps ■ 

Like the classical (conditional) p-value, p_g is used for model 
checking but has the advantage of having no dependence on un- 
known parameters. The posterior predictive distribution of T in- 
cludes the uncertainty in th e classical p -value pc due to the un- 
known nuisance parameters dMendll994l) . 

The posterior predictive p-value is a single summary of the 
agreement between data and model, and may be used to assess 
whether the data are consistent with being drawn from the model: 
a p-value that is not extreme (i.e. not very close to or 1) shows 
the observed value r oba is not an outlier in the population T Tep . 
IGelman et all J2004l) and lProtassov et al] {2002) argue that model 
checking based on the posterior predictive distribution is less sen- 
sitive to the choice of priors (on the parameters), and more useful 
in identifying deficiencies in the model, compared to Bayes factors 
or posterior odds (eqn. [7). 



5 CONDITIONAL SIGNIFICANCE OF PERIODOGRAM 
PEAKS 

We now return to the problem of assessing the significance of peaks 
in periodograms of noisy time series. The null hypothesis, Ho, 
in this case is that the time series was the product of a stochas- 
tic process. It is well known that the periodogram of any stochas- 
tic time series of length N, denoted Ij — I(fj) at Fourier fre- 
quency fj = j/NAT (with j = 1, . . . ,N/2), is exponentially 
distributee^] about the true spectral density Sj = S(fj) 



P(Ji\Si 



(13) 



(see Ijenkins & Wattsl [l969l: iGrotj Il975l; IPriestleyl 1 198ll : 
Leahy et al.l 1 19831 : Ivan der Klisl 1 19891; iPress et al.l Il992l: 



Percival & Walden 1993; Timmer & Konigj 1 19951 : [Bloomfield 
2000; Chatfield 2003). Strictly speaking this is valid for the Fourier 
frequencies other than the zero and Nyqist frequency (j = 1 and 
j — N/2), which follow a different distribution, although in the 
limit of large TV this difference is almost always inconsequential. 
This distribution means that the ratio of the periodogram ordinates 

3 The exponential distribution p(x\\) = \e~^ x is a special case of the 
chi square distribution \i with v = 2 degrees of fre edom, and a specia l 
case of the gamma d i stribution, 1/A ). See e.g. [ Eadie et al. il97l|) . 
ICarlin & Louis] feOOOh . lGelman et alj <2004l) or lLee] <2004l) for more on spe- 
cific distribution functions. 



Ij to the true spectrum Sj will be identically distributed. If we 
have a parametric spectral model with known parameters, Sj(0), 
the ratio 



(14) 



R° ha = 2I° ha /Sj(0) 

will be distributed as Xv with v = 2 degrees of freedom (see lV05l) 
and it is trivial to integrate this density function to find the classi- 
cal tail area p-value corresponding to a given observed datum I° bs 
This simple fact is the basis of many "textbook" frequentist tests 
for periodicity. However, pc depends the parameters 6 (and, more 
generally, the model H), which in general we do not know. 

The standard solution is to estimate the parameters, e.g. by fit- 
ting the periodogram data, and thereby estimate the spectral density 
Sj under the null hypothesis, call this Sj, and use this estimate in 
the test statistic 



Rf s = 2I° hB /Sj. 



(15) 



The problem is that the distribution of Rj will not be simply x\ 
since that does not account for the uncertainty in the spectral es- 
timate Sj. V05 presented a partial solution to this, by treating the 
statistic Rj as the ratio of two random variables under certain sim- 
plifying assumptions. In what follows we use Bayesian methods to 
develop a much more general method for estimating the parameters 
of a power spectral model, and posterior predictive model checking 
to check the quality of a model fit and to map out the distribution 
of Rj conditional on the observed data. 



6 PERIODOGRAM ANALYSIS VIA THE LIKELIHOOD 
FUNCTION 

As discussed in |V05L an d based on the results of 
IGeweke & Porter-HudakI jl983h , a very simple way to obtain 
a reasonable estimate of the index and normalisation of a power 
law power spectrum, S(f) = /3f~ a , is by linear re gression of 
log/° bs on log/j (see also IPilgram & Kaplarj fl998l) . This pro- 
vides approximately unbiased and normally distributed estimates 
of the power law index (a) and normalisation (actually log (3) even 
for relatively few periodogram points (i.e. short time series). 

The log periodogram regression method has the advantage of 
being extremely simple computationally, so that estimates of the 
power law parameters (and their uncertainties) can be found with 
minimal effort. However, the method does not easily generalise to 
other model forms and does not give the same results as direct max- 
imum likelihood analysi even in the special case of a power law 
model. 

As discussed in lAndersonetall ( ll990h . and also Appendix A 
of |V05L maximum likelihood estimates (MLEs) of the parameters 
of a model S(6) may be found by maximizing the joint likelihood 
function 



N/2 

p(i\e,H) = l[ P (ij\Sj 



(16) 



4 lAnderssonl fe002l) provided a modification of the 
Gewek e & Porter-H udak 11983) fitting method based on the fact that 
the logarithm of the periodogram ordinates follow a Gumbel distribution. 
He gives the log likelihood function for the logarithm of the periodogram 
fitted with a linear function. Maximising this function should give the 
maximum likelihood estimates of the power law parameters. 
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(cf. eqnll3t. or equivalently minimising the following function 



£>(I,0,tf) = -21ogp(I|0,tf) 



N/2 

2 E 

2=1 



+ logS J 



(17) 



which is twice the minus log likelihoorjfl This is somet imes 
known as the W hittle likelihood method, after I Whittle! i 1953h and 
Whittle! Jl957h . and has be en discussed in detail els ewhere (e.g . 
Hannanl fl973r IPawitan & O'SullivarJ 1 19941 : iFan & Zhangl 12004 



Contreras-Cristan et al.ll2006h. Here we u se the notation D(I, 6) 
for consistency with Gelman et alj J20041 section 6.7) where it is 
used as the deviance, a generalisation of the common weighted 
square deviation (or chi square) statistic. Finding the MLEs of the 
parameters is the same as finding 



0mle = arg min D(I° S ,6,H) 

e 

= arg maxp(I = I° bs |#, H) 

e 



(18) 



7 BAYESIAN PERIODOGRAM ANALYSIS THROUGH 
MCMC 

We have now laid the groundwork for a fully Bayesian periodogram 
analysis. Equation QJo] gives the likelihood function for the data 
given the model S(0), or equivalently, equation[T7]gives the minus 
log likelihood function, which is often easier to work with. Once 
we assign a prior distribution on the model parameters we can ob- 
tain their joint posterior distribution using Bayes theorem (eqn[TJ 



p(O\I,H)cxp(I\O,H) P (0\H) = q{6\l,H). 



(19) 



where q(6\I, H) is the unnormalised (joint posterior) density func- 
tion (the normalisation does not depend on 6). This can be sum- 
marised by the posterior mean (or mode) and credible intervals 
(equations[2]and[3}. We may also assess the overall fit using a pos- 
terior predictive p- value for some useful test quantity. 

We may now write an expression for the joint posterior density 
(up to a normalisation term), or its negative logarithm (up to an 
additive constant) 



log 9(011, H) = D(I, 0, H)/2 - logp(0|#) 



(20) 



The posterior mode may then be found by minimising this function 
(e.g. using a good numerical non-linear minimisation algorithm, 
or Monte Carlo methods in the case of complex, multi-parameter 
models). 

In the limit of large sample size (N — > oo) the posterior den- 
sity will tend to a mult ivariate Normal und er quite general condi- 
tions (see chapter 4 of iGelman et alj [2004). For finite N we may 
make a first approximation to the posterior using a multivariate 
Normal distribution centred on the mode and with a covariance 
matrix E equal to the curvature of the log posterior at the mode 

5 The periodogram is \% distributed (equation ll3t for Fourier frequencies 
j = 1,2,..., N/2 - 1. At the Nyquist frequency (j = N/2) it has a x\ 
distribution. One could choose to ignore the Nyquist frequency (sum over 
j = 1, . . . , N/2 — 1 only), or modify the likelihood function to account for 
this. But in the limit of large N the effect on the overall likelihood should 
be negligible, and so we ignore it here and sum over all non-zero Fourier 
frequencies. 

6 For a function f(x), arg min f(x) gives the the set of points of x for 
which f(x) attains its minimum value. 



(see lGelman et al.l2004l section 12.2 and lAlbertl2007l . section 5.5). 
(Approximating the posterior as a Normal in this way is often called 
the Laplace approximation.) This can be used as the basis of a pro- 
posal distribution in a Markov chain Monte Carlo (MCMC) algo- 
rithm that can efficiently generate draws from the posterior dis- 
tribution q(0\I, H), given some data I = I obs . The MCMC was 
generated by a random-walk Metropolis-Hastings algorithm using 
a multivariate Normal (with the covariance matrix as above, but 
centred on the most recent iteration) as the proposal distribution. 
More details on posterior simulation using MCMC is given in Ap- 
pendix [A] For each set of simulated parameters we may generate 
the corresponding spectral model S(6) and use this to generate a 
periodogram F op from the posterior predictive distribution (which 
in turn may be used to generate a time series if needed, see Ap- 
pendix |B). 



8 POSTERIOR PREDICTIVE PERIODOGRAM CHECKS 

With the data simulated from the posterior predictive distribu- 
tion, I rcp , we may calculate the distribution of any test statis- 
tic. Of course, we wish to use statistics that are sensitive to the 
kinds of model deficiency we are interested in detecting, such as 
breaks/bends in the smooth continuum, and narrow peaks due to 
QPOs. Given the arguments of sections[5]a sensible choice of statis- 
tic for investigating QPOs is Tr == maxj Rj (see equation 1151 ). 
Notice that there is no need to perform a multiple-trial (Bonfer- 
roni) correction to account for the fact that many frequencies are 
tested before the strongest candidate is selected, as long as this ex- 
act procedure is also applied to the simulated data as the real data. 

Another useful statistic is based on the traditional \ 2 statistic, 
i.e. the sum of the squared standard errors 
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where E[-] and V[-] indicate expectation and variance, respectively. 
We use Tsse — X 2 (Ij where 6 is the mode of the posterior dis- 
tribution. This is an "omnibus" test of the overall data-model match 
("goodness-of-fit") and will be more sensitive to inadequacies in 
the continuum modelling since all data points are included (not just 
the larg est outlier as in Tr). Th is is the same as the merit function 
used by I Anderson et al. N 199(1 eqn. 16), which we call Tsse (for 
Summed Square Error). 

The above two statistics are useful for assessing different as- 
pects of model fitness. By contrast the L i kelihood Ratio Test (L RT) 
statistic jEadie et alJl97lMCowanlll998l : |Protassov et al.ll2002l) is a 
standard tool for comparing nested models. As such it may be used 
to select a continuum model prior to investigating the residuals for 
possible QPOs. The LRT statistic is equal to twice the logarithm of 
the ratio of the likelihood maxima for the two models, equivalent to 
the difference between the deviance (which is twice the minimum 
log likelihood) of the two models 
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(22) 



Asymptotic theory shows that, given certain regularity conditions 
are met, this statistic should be distributed as a chi square vari- 
able, TLrt ~ Xv, where the number of degrees of freedom v is 
the difference between the number of free parame ters in H\ and 
Ho- When the regularity conditions are not met (see lFreeman et all 
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1 19991 ; |Protassovetai]|2002l ; IPark et alj |20o3) we do not expect 
the distribution to be that of the asymptotic theory. Neverthe- 
less, the LRT is a powerful statistic for comparing models and 
can be calibrated by p oste rior predictive s imula tion, as shown by 
IProtassov etal] fcOOa and lRubin & Sternl dl994l) . 



9 APPLICATION TO AGN DATA 

In this section we apply the method detailed above to two exam- 
ple datasets, both long observations of nearby, variable Seyfert 1 
galaxies, obtained from the XMM-Newton Science Archivqj. 




p=0.001 



9.1 The power spectrum model 

We shall restrict ourselves to two simple models for the high fre- 
quency power spectrum of the Seyferts. The first (Ho) is a power 
law plus a constant (to account for the Poisson noise in the detec- 
tion process) 



s(,n = i3.r a + 1 



(23) 



with three parameters 6 = {a, 0, 7}, where (the power law nor- 
malisation) and 7 (the additive constant) are constrained to be non- 
negative . The second model (H i) is a bending power law as advo- 
cated bv lMcHardv et alj d2004l) 



S(f) = 0f 



+ 7 



(24) 



with four parameters 9 = {a, 0, 7, S}. For this model 0, 7 and 
5 (the bending frequency) are all non-negative. The parameter a 
gives the slope at high frequencies (/ 3> S) in model Hi, and 
the low frequency slope is assumed to be —1. (This assumption 
simplifie s the model fitting process, and seems reasonable given the 
result s of lUttlev et all2002MMarkowitz et al.l2003l ; lMcHardv et al.l 
20041 and lMcHardv et al.ll2006l . but could be relaxed if the model 
checking process indicated a significant model misfit.) In the limit 
of 5 — > the form of Hi tends to t hat of the simple pow er law Ho. 

Following the advice given in lGelman et alj J2004l) we apply 
a logarithmic transformation to the non-negative parameters. The 
motivation for this is that the posterior should be more symmet- 
ric (closer to Normal), and so easier to summarise and handle in 
computations, if expressed in terms of the transformed parameters. 
We assign a uniform (uninformative) prior density[f| to the trans- 
formed parameters, e.g. p(a, log 0, log 7) = const for model Ho. 
This corresponds to a uniform prior density on the slope a and a 
Jeffreys prior on the parameters restricted to be non-negative (e.g. 
p(0) — 1/0), which i s the conventional prior for a scale factor 
ial ll996l : iGelman et alj|2004l ; lGregorvll200l ; lAlbertl 
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7 See http : //xmm .esac.esa. int /| 

8 Although strictly speaking these prior densities are improper, meaning 
they do not integrate to unity, we may easily define the prior density to be 
positive only within some large but reasonable range of parameter values, 
and zero elsewhere, and thereby arrive at a proper prior density. In the limit 
of large N the likelihood will dominate over the uninformative prior and 
hence the exact form of the prior density will become irrelevant to the pos- 
terior inferences. 
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Figure 1. Posterior predictive distribution of the LRT statistic under Ho 
for the RE J1034 + 396 data (computed using 5, 000 datasets simulated 
under Ho)' The observed value T£p5p = 9.67 is shown with the verti- 
cal line, alongside the corresponding p-value. The distribution is not \\ 
as would be predicted by the standard theory but instead resembles a mix- 
ture of distributions with half the probability in a \i distribution and half 
concentrated around zer o. This might be expected given the arg uments of 
iTitterington et alj (l985, section 5.4) and lProtassov et al] l2002t Appendix 
B). 



Table 2. Posterior summaries of parameters for model Hi for the RE 
J1034 + 396 data. The four parameters are as follows: a = power law 
index, (3 = normalisation (in power density units at 1 Hz, i.e. [rms/mean] 2 
Hz -1 ), 7 (Poisson noise level in power density units, [rms/mean] 2 Hz -1 ), 
8 (bend frequency in Hz). The columns give the parameter name, the pos- 
terior mean and the lower and upper bounds of the 90 per cent credible 
intervals. 
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9.2 Application to XMM-Newton data of RE J1034+396 

The first test case we discuss is the interesting XMM-Newton 
observation of t he ult rasoft Seyfert 1 galaxy RE J1034 + 396. 
iGierlinski et"ai1 (2008) analysed these data and reported the detec- 
tion of a significant QPO which, if confirmed in repeat observations 
and by independent analyses, would be the first robust detection 
of its kind. For the present analysis a 0.2 — 10 keV time series 
was extracted from t he archival data using standard methods (e.g. 
Vaug han et ai]|2 003) and binned to 100 s, to match that used by 
Gierlihski et al.l j2008h . 

The two candidate continuum models discussed above, Ho 
and Hi were compared to the data, which gave D^ ni B n (Ho) = 



504.89 and D°f; s n (_ffi) = 495.22, therefore T^ T = 9.67. The 
MCMC was used to draw from the posterior of model Ho, and these 
draws were used to generate posterior predictive periodogram data, 
which were also fitted with the two models and the results used 
to map out the posterior predictive distribution of TLrt, which is 
shown in Fig.[TJ The corresponding tail area probability for the ob- 
served value is p = 0.001, small enough that the observed reduc- 
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Table 3. Posterior summaries of parameters for model Hi for the Mrk 766 
data. The columns are as in TablefJ] 
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Figure 3. RE J1034 + 396 data and model (Hi ) computed at the posterior 
mode. The data are shown as the histogram and the model is shown with 
the smooth curve. T he lower panel shows t he data/model residuals on a 
logarithmic scale. SeelGierliriski et al. 1 2008) for details of the observation. 



tion in D m i n between Ho and Hi is larger than might be expected 
by chance if Hq were true. We therefore favour Hi and use this as 
the continuum model. In the absence of complicating factors (see 
below) this amounts to a significant detection of a power spectral 
break. 

Using Hi as the continuum model we then map out the pos- 
terior distribution of the parameters using another MCMC sample. 
Table [2] presents the posterior means and intervals for the parame- 
ters of model Hi, and figure [2] shows the pairwise marginal poste- 
rior distributions for the parameters of the model. Figure [3] shows 
the data and model evaluated at the posterior mode. 

Clearly there is a large outlier at ~ 2.5 x 10 -3 Hz in the 
residuals after dividing out the model (Hi, computed at the pos- 
terior mode) which may be due to additional power from a QPO. 
We therefore calculate the posterior predictive distributions of the 
two test statistics Tr and Tsse and compared these to the observed 
values (T£ bs = 18.41 and 2||| = 542.3). The posterior predic- 
tive distributions of these two statistics, derived from 5, 000 simu- 
lations, are shown in Fig. [4] Both these statistics give moderately 
low p-values (pr = 0.035 and psse = 0.025), indicating there 
is room for improvement in the model and that the largest outlier 
is indeed rather unusual under Hi . This may indicate the presence 
of power from a QPO or some other deficiency in the continuum 
model. Very similar results were obtained after repeating the pos- 
terior predictive p- value calculations with a variant of Hi in which 
the low frequency index (at / <C 5) is fixed at rather than — 1, 
indicating that the p-values are not very sensitive to this aspect of 
the continuum model. 

lGierlrnskietal.lf200c?) split the time series into two segments 
and focussed their analysis on the second of these, for which the pe- 
riodogram residual was largest and concentrated in one frequency 
bin only. The division of the data into segments is based on a 
partial analysis of the data - it is in effect the application of a 
data-dependent "stopping rule" - and it is extremely difficult to 
see how such a procedure could be included in the generation of 
replicated data F cp used to calibrated the posterior predictive p- 
values. We therefore consider p-values only for the analysis of the 
entire time series and do not try to replicate exactly the analysis of 
iGierlinski et"al] faOOSh . 
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Figure 5. Mrk 766 data and model (Hi ) computed at the posterior mode. 
The panels are the same as in FigurefJ] 



9.3 Application to XMM-Newton data of Mrk 766 

A similar analysis was performed on the XMM-Newton observa - 
tion of Mrk 766 discussed previously bv lVaughan & Fabianl d2003l) . 
who claimed to have detected a power spectral break using frequen- 
tist (classical) statistical tools such as \ 2 fitting. The LRT statistic 
for the data was T£rt = 18.56, and the posterior predictive dis- 
tribution for this statistic had the same shape as in the case of RE 
J1034 + 396 (FigureQ}. The p-value for the LRT comparison be- 
tween Ho and Hi was p < 2 x 10~ 4 (i.e. not one of the 5, 000 
simulations gave a larger value of Tlrt)- This amounts to a very 
strong preference for Hi over Ho, i.e. a solid detection of a spectral 
break. 

Table [3] summarises the posterior inferences for the parame- 
ters of Hi and Figure [5] shows the data, model and residuals. The 
residuals show no extreme outliers, and indeed the observed values 
of the test statistics Tr, and Tsse were not outliers in their poste- 
rior predictive distributions (pr = 0.93 and psse = 0.89). These 
suggest that Hi provides an adequate description of the data (i.e. 
without any additional components). 



9.4 Sensitivity to choice of priors 

It is important to check the sensitivity of the conclusions to the 
choice of the prior densities, by studying, for example, the effect of 
a different or modified choice of prior on the posterior inferences. 
We have therefore repeated the analysis of the RE J1034 + 396 data 
using a different choice of priors. In particular, we used indepen- 
dent Normal densities on the four transformed parameters of Hi, 
this is equivalent to a Normal density on the index a and log nor- 
mal densities on the non-negative valued parameters (5, 7 and S. In 
other words, for each of the transformed parameters p(9i\Hi) = 
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Figure 2. Pairwise marginal posterior distributions for the parameters of H±: a = power law index, j3 = normalisation (in power density units at 1 Hz, i.e. 
[rms/mean] 2 Hz -1 ), 7 (Poisson noise level in power density units, [rms/mean] 2 Hz -1 ), 8 (bend frequency in Hz). The parameters /3 and 8 are shown on 
a logarithmic scale. The lower- left panels show the contours evaluated using all 75, 000 posterior simulations, and the upper-left panels show some of the 
simulated posterior data (for clarity only 1, 000 points are shown). 
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Figure 4. Posterior predictive distributions of the Tr and TggE statistics under Hi for the RE J1034 + 396 data. The observed value of each is shown with a 
vertical line. 



N(ni,ai) where the hyper-parameters \n and <n control the mean 
and width of the prior density functions. After choosing values 
for the hyperparameters based on knowledge gai ned from previous 
studies of nearby, luminous Seyfert galaxies (e.g.[U ttley et al. 2002; 
iMarkowitz et alj|2003l ; |Papadakidl2004l ; iMcHardv et alj|2006t) . as 
outlined below, the posterior summaries (parameter means and in- 
tervals, pairwise marginal posterior contours, and posterior predic- 
tive p- values) were essentially unchanged, indicating that the infer- 
ences are relatively stable to the choice of prior. 

Previous studies usually gave a high frequency index parame- 
ter in the range a ~ 1 — 3, and so we assigned p(a \ Hi) = iV(2,4), 
i.e. a prior centred on the typical index of 2 but with a large disper- 
sion (standard deviation of 2). The normalisation of the / _1 part 



of the power spectrum is thought to be similar between different 
sources, with p ~ 0.005 - 0.03 (see |Papadakisi r2004). we as- 
signed p(log P\Hi) — N(~2, 1), i.e. a decade dispersion around 
the mean of p ~ 10~ 2 . The Poisson noise level is dependent on 
the count rate, which can be predicted very crudely based on previ- 
ous X-ray observations; we assign a prior p(log 7I-H1) = N(0, 1). 
The bend/break frequency 5 is thought to correlated with other sys- 
tem parameters such as A/bh, bolometric luminosity Lb i and op- 
tical line width (e.g. FWHM H/3). Using the estimated luminos- 
ity, and assu ming RE J1034 + 396 is radiating close to the Ed- 
dington limit (Middl eton et al.l2009l) gave a prediction for the bend 
ti mescale of Tb ~ 1 . 6 x 1 -3 s, and using the optical line width 
of lVeron-Cettv et alj J200lh gave T b ~ 1.2 x 10~ 3 s, using the re- 
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lations of McHar dv et al.l j2006). Both these (independent) predic- 
tions suggest S = 1/Tb ~ 10 -3 Hz, and we therefore assigned a 
prior density p(logS\Hi) = N( — 3, 1). All of these priors are rea- 
sonably non-informative - they have quite large dispersion around 
the mean values, to account for the fact that the empirical rela- 
tions used make these predictions are rather uncertain themselves 
and also contain intrinsic scatter (i.e. there are significant source to 
source differences) - yet they do include salient information about 
the model obtained from other sources. 



10 DISCUSSION 

We have described, in sections 1 6181 a Bayesian analysis of peri- 
odogram data that can be used to estimate the parameters of a 
power spectral model of a stochastic process, compare two com- 
peting continuum models, and test for the presence of a narrow 
QPO (or strict periodicity). 

10.1 Limitations of the method 

The Whittle likelihood function (equation 1 16b is only an approxi- 
mation to the true sampling distribution of a periodogram. In the 
absence of distortions due to the sampling window (more on this 
below), the ordinates of the periodogram of all stationary, linear 
(and many non-linear) stochastic processes become independently 
distributed following equation [T3] as N —* 00. With finite N (i.e. 
for real data) this is only approximately true, although with reason- 
able sample sizes (e.g. N > 100) it is a very good approximation. 

More serious worries about the distribution of the peri- 
odogram, and hence the validity of the Whittle likelihood, come 
from distorti ons due to the sam pling effects known as aliasing and 
leakage (e.g. lUttlev et alj2002h . It is fairly well established that X- 
ray light curves from Seyfert galaxies are stationary once allowance 
has been made fo r their red noise character and the line ar "rms- 
flux" relation (see lVaughan et alj|2003l : luttlev et ai]|2005l) . Distor- 
tions in the expectation of the periodogram can be modelled by 
simulating many time series for a given power spectral model, re- 
sampling these in time as for the origi nal data, and then calculat- 
ing the average of their periodograms JUttlev et alj[2002l and Ap- 
pendix[B). This does not account for distortions in the distribution 
of the periodogram ordinates (away from equation[l3]predicted by 
asymptotic theory), which is a more challenging problem with (as 
yet) no accepted solution. However, these affects will be minor or 
negligible for the data analysed in section [9] which are contigu- 
ously binned, as the effect of aliasing will be lost in the Poisson 
noise spectrum which do minates at high frequencies ( Ivan der Klisl 
ll989l ; IUttlevetal.ll2002h . and the leakage of power from lower to 
higher frequencies is very low in cases where the power spectrum 
index is a <J 1 at the lowest observed frequencies. The task of 
fully accounting for sampling distortions in both the expectation 
and distribution of the periodogram, and hence having a more gen- 
eral likelihood function, is left for future work. 

We should also point out that the usual limitations on the use 
and interpretation of the periodogram apply. These include the (ap- 
proximate) validity of the Whittle likelihood only when the time 
series data are evenly sampled. It may be possible to adjust the like- 
lihood function to account for the non-independence of ordinates in 
the modifie d periodogram usually used with unevenly sampled time 
series (e.g. lScargldfl982l) , but here we consider only evenly sam- 
pled data. It is also the case that the periodogram, based on a de- 
composition of the time series into sinusoidal components, is most 



sensitive to sinusiodal oscillations, especially when they lie close to 
a Fourier fr equency (i.e. the t ime series spans an integer number of 
cycles; see Ivan der Klij[T989h . In situations where the time series 
is large and spans many cycles of any possible periods (the large N 
regime), there is no reason to go beyond the standard tools of time 
series processing such as the (time and/or frequ ency) binned peri - 
odogram with approximately normal error bars Ivan der Klisll989l) . 
The current method uses the raw periodogram of a single time se- 
ries (with the Whittle likelihood) in order to preserve the frequency 
resolution and bandpass of the data, which is more important in the 
low N regime (e.g. when only a few cycles of a suspected period 
are observed). 

The time series data analysed in section [9] were binned up to 
100 s prior to computing the periodogram; this in effect ignores 
frequencies above 5 x 10~ 3 Hz which are sampled by the raw data 
from the detectors (recorded in counts per CCD frame at a much 
higher rate). The choice of bin size does affect the sensitivity to pe- 
riodic signals of the method described in sections 16181 Obviously 
one looses sensitivity to periodic components at frequencies higher 
than the Nyquist frequency. But also as more frequencies are in- 
cluded in the analysis there are more chances to find high Tr values 
from each simulation, which means the posterior predictive distri- 
bution of the test statistic does depend on the choice of binning. 

One could mitigate against this by imposing a priori restric- 
tions on the frequencies of any allowed periods, for example by al- 
tering the test statistic to be Tr, = maxj<j Rj where Jo is some 
upper limit. (The lower frequency of the periodogram is restricted 
by the duration of the time series, which is often dictated by obser- 
vational constraints.) But these must be specified independently of 
the data, otherwise this is in effect another data-dependent stopping 
rule (the effect of limiting the frequency range of the search is il- 
lustrated below in the case of the RE 11034 + 396). This sensitivity 
to choice of binning could be handled more effectively by consid- 
ering the full frequency range of the periodogram (i.e. no rebinning 
of the raw data) and explicitly modelling the periodic component of 
the spectrum with an appropriate prior on the frequency range (or 
an equivalent modelling procedure in the time domain). But this 
suffers from the practical drawbacks discussed below. 

10.2 Alternative approaches to model selection 

In many settings the Likelihood Ratio Test (LRT, or the closely 
related F-test) is used to choose between two competing models: 
the observed value of the LRT statistic is compared to its theoretical 
sampling (or reference) distribution, and this is usually summarised 
with a tail area probability, or p- value. As discussed above this pro- 
cedure is not valid unless certain specific conditions are satisfied by 
the data and models. In the case of comparing a single power law 
(Ho of section [9} to a bending power law (Hi) the simpler model 
is reproduced by setting the extra parameter S — > in the more 
complex model, which violates one of the conditions required by 
the LRT (namely that null values of the extra parameters should 
not lie at the boundaries of the parameter space). In order to use 
the LRT we must find the distribution of the statistic appropriate 
for the given data and models, which can be done using posterior 
predictive simulations. This method has the benefit of naturally ac- 
counting for nuisance parameters by giving the expectation of the 
classical p-value over the posterior distribution of the (unknown) 
nuisance parameters. 

One could in principle use the posterior predictive checks to 
compare a continuum only model (e.g. Ho or Hi) to a continuum 
plus line (QPO) model (H2) and thereby test for the presence of 



10 S. Vaughan 



an additional OPO. IProtassov et al.l J2002h and IPark et all J2008h 
tackled just this problem in the context of X-ray energy spectra 
with few counts. However, we deliberately do not define and use 
a model with an additional line for the following reasons. Firstly, 
this would require a specific line model and a prior density on 
the line parameters, and it is hard to imagine these being gener- 
ally accepted. Unless the line signal is very strong the resulting 
posterior inferences may be more sensitive to the (difficult) choice 
of priors than w e would generally wish. Secondly, as shown by 
IPark et all d2008l) . there are considerable computational difficulties 
when using models with additional, narrow features and data with 
high variance (as periodograms invariably do), due to the highly 
multi-modal structure of the likelihood function. Our pragmatic al- 
ternative is to leave the continuum plus line model unspecified, but 
instead choose a test statistic that is particularly sensitive to narrow 
exce sses in power such as might be produced under such a model 
(see iGelman et af]| 19961 and associated discussions, for more on 
the choice of test statistic in identifying model deficiency). This 
has the advantages of not requiring us to specify priors on the line 
parameters and simplifying the computations, but means the test 
is only sensitive to specific types of additional features that have a 
large effect on the chosen test statistic. (It is also worth pointing out 
that the periodogram ordinates are randomly distributed about the 
spectrum of the stocastic process Sr(/). If a deterministic process 
is also present, e.g. producing a strictly periodic component to the 
signal, this will not in general follow the same \ 2 distribution and 
the Whittle likelihood function would need to be modified in order 
to explicitly model such processes in the spectral domain.) 

One of the most popular Bayesian met hods for choosing be- 
tween competing models is the Bayes factor |Kass & Raftervll9"9^ : 
ICarlin & Louisll2000l ; IGelman et ai]|2004 lLeell2004h . These pro- 
vide a direct comparison of the weight of evidence in favour of 
one model compared to its competitor, in terms of the ratios of 
the marginal likelihoods for the two models (equation [7}. This 
may be more philosophically attractive than the posterior predic- 
tive model checking approach but in practice suffers from the same 
problems outlined above, namely the computational challenge of 
handling a multi-modal likelihood, and the sensitivity to priors on 
the line parameters, which may be even greater for Bayes fac- 
tors than other m ethods (see arguments in IProtassov et al. 2002; 
IGelman et al]|2004) . 



10.3 Comparison with lVOSl 

|V05| tackled the same problem - the assessment of red noise spec- 
tra and detection of additional periodic components from short time 
series - using frequentist methods. The method developed in the 
present paper is superior in a number of ways. The new method is 
more general in the sense that the model for the continuum power 
spectrum (i.e. the "null hypothesis" model that contains no peri- 
odicities) may in principle take any parametric form but was previ- 
ously restricted to a power law. It also provides a natural framework 
for assessing the validity of the continuum model, which should be 
a crucial step in assessing the evidence for additional spectral fea- 
tur es (see below). Also, by us i ng the Whittle likelihood rather than 
the iGeweke & Porter-Hudald dl983l) fit function, the new method 
actu ally gives smaller mean square errors on the model parameters 
(see|Xndersson 20"02h. 



10.4 Comparison with other time series methods 

Previous work on Bayesian methods fo r period detecti on (e.g. 
iBrett horst 1988; iGregorv & Loredo|[l992l ; lGregorvH l999) has fo- 
cussed on cases where the stochastic process is assumed to be white 
(uncorrelated) noise on which a strictly periodic signal is super- 
posed. They do not explicitly tackle the more general situation of 
a non-white continuum spectrum that is crucial to analysing data 
from compact accreting X-ray sources. 

The only non-Bayesian (i.e. frequentist) methods we are aware 
of for assessing evidence for periodicities in data with a non-white 
spectrum involve applying some kind of smoothing to the raw pe- 
riodogram data. This gives a non-parametric estimate of the un- 
derlying spectrum, with some associated uncertainty on the esti- 
mate, which can then be compared to the unsmoothed periodogram 
data and used to search fo r outlying periodo gram points. The Multi- 
Taper Method (MTM) of iThomsorj dl982h (also lThompsor]|l990l) 
achieves the smoothing by averaging the multiple periodograms, 
each co mputed using one member of a set of orthogonal data ta- 
pers. See lPercival & Waldenl d 19931 chapter 7) for a good discus- 
sion of this method. The data tapers are designed to reduce spec- 
tral leakage and so reduced bi as in the resulting spe ctrum esti- 
mate. The method proposed by llsrael & Stellal 1 1996) involves a 
more straightforward running mean of the peridogram data. Both 
of these are non-parametric methods, meaning that they do not in- 
volve a specific parametric model for the underlying spectrum. This 
lack of model dependence might appear to be an advantage, but in 
fact may be a disadvantage in cases where we do have good rea- 
sons for invoking a particular type of parametric model (e.g. the 
bending power laws seen in the Seyfert galaxy data). The contin- 
uum model's few parameters may be well constrained by the data, 
where the non-parametric (smoothed) estimate at each frequency is 
not. The non-parametric methods also leave a somewhat arbitrary 
choice of how to perform the smoothing, i.e. the type and number of 
data t apers in the MTM, or the size/shape of the smoothing kernel 
in the llsrael & Stellal {1996) method. Also, it is less obvious how 
to combine the sampling distribution of the periodogram ordinate 
(line component) and the spectrum estimate (continuum), and how 
to account for the number of "independent" frequencies searched. 
These are all automatically included in the posterior predictive p- 
value method as outlined above. 

In the present paper we have deliberately concentrated on the 
periodogram since this is the standard tool for time series analysis 
in astronomy. But the periodogram is by no means the best or only 
tool for the characterisation of stochastic processes or the identi- 
fication of periodicities. Methods that explicitly model the origi- 
nal time series data in the time domain (see e.g. IPriestlevI 1 1 98 ll : 
IChatfieldi r2003) may yet prove to be valuable additions to the as- 
tronomers toolkit. Indeed the raw form of the XMM-Newton data 
used in the AGN examples is counts per CCD frame, for the source 
(and possibly background region if this is a non-negligible contri- 
bution). The most direct data analysis would therefore model this 
process explicitly as a Poisson process with a rate parameter that 
varies with time (i.e. the "true" X-ray flux) that is itself a realisa- 
tion of some stochastic process with specific properties (e.g. power 
spectrum or, equivalently, autocorrelation function, and stationary 
distribution). 

10.5 The importance of model assessment 

The posterior predictive approach provides an attractive scheme for 
model checking. In particular, it allows us to select a continuum 
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Figure 6. Simulated time series generated from the posterior predictive 
distribution of the RE J1034 + 396 periodogram data. The (grey) his- 
togram shows the simulated data in 100 s bins and the smooth (red) curve 
shows the 6 bin moving average of these data. Compare with Figure 1 of 
Gierl inski et aD (2008). The power spectrum used to generate these data is 
a smoothly bending power law (plus white "measurement" noise) with no 
periodic or quasi-periodic components, and they the time series appears to 
show oscillatory structure. 



model that is consistent with the observed datjfl before testing for 
the presence of additional features. This is crucial since any sim- 
ple test statistic, whether used in a frequentist significance test or 
a posterior predictive test, will be sensitive to certain kinds of de- 
ficiencies in the model without itself providing any additional in- 
formation about the specific nature of any deficiency detected (a 
p- values is after all just a single number summary). A low p- value 
(i.e. a "significant" result) may be due to the presence of interest- 
ing additional features or just an overall poor match between the 
data and the conti nuum model (for more o n this in the context of 
QPO detection see IVaughan & Uttlevll2006l) . The use of more than 
one test statistic, properly calibrated using the posterior predictive 
simulations, as well as other model diagnostics (such as data/model 
residual plots) are useful in identifying the cause of the data/model 
mismatch. 



10.6 Analysis of two Seyfert galaxies 

Section[9]presents an analysis of XMM-Newton data for the Seyfert 
galaxies RE J1034 + 396 and Mrk 766. The former has pro- 
duced the best eviden ce to date for a QPO in a Seyfert galaxy 
dGierlinski etal] 120081) , while the latter showed no indication 
of Q P O behaviour dVaughan & Fabianl 120031 : IVaughan & Uttlevl 
12003) . IGierlinski et al.1 j2008l) used the method presented in |V05I 



to show that the observed peak in the periodogram was highly un- 
likely under the assumption than the underlying power spectrum 
continuum is a power law, but the present analysis gave somewhat 
less impressive evidence to suggest a QPO. 

The posterior predictive p ~ 0.03 comes from the fact that 
~ 150 out of the 5, 000 posterior predictive simulations of the RE 



9 Strictly, we compare the observed data to simulations drawn from the 
posterior predictive distribution under the chosen model H using test statis- 
tics. If the observed data do not stand out from the simulations, by having 
extreme values of the statistics when compared to the simulations, we may 
assume that the data are consistent with the model (as far as the particular 
test statistics are concerned). 



J1034 + 396 periodogram data showed Tr > Tr (and approx- 
imately the same figure was obtained using Tsse)- This might at 
first seem doubtful giv en how periodic the ob served time series ap- 
pears (see Figure 1 of IGierlinski et alj|2008l) . But to demonstrate 
that such apparently periodic time series may indeed be gener- 
ated from non-periodic processes we simulated time series from the 
posterior predictive periodogram data (for model Hi) that showed 
Tr > T£ bs . (The time series simulation method is given in Ap- 
pendix [B]) One example of these time series, chosen at random 
from the subset that had the largest residual Rj occurring at a fre- 
quency of the same order as that seen in RE J1034 + 396 (in this 
case ~ 1.3 x 10 -4 Hz), is shown in Figure[6] 

There are several reasons for the very different p-values be- 
tween the analyses. One of these factors is that we based our cal- 
culation on a more general form of the continuum model. In the 
absence of a QPO (spectral line component) the power spectrum 
continuum is well modelled using a power law with a steep slope 
(a ~ 3) that smoothly changes to a flatter slope (assumed index of 
— 1) below a frequency 5 ~ 4 x 10~ 4 Hz, than a single power law. 
The bend frequency is close to that of the candidate QPO, which 
does have a large effect on the "significance" of the QPO as sum- 
marised in the p-value (see IVaughan~& Uttlev 2005], for previous 
examples of this effect). Indeed, the posterior predictive p-value 
was 2 x 10~ 3 when recalculated assumi ng a simple power law 
continuum (Ho). A second factor is that Gier linski et al.l (2008) 
gave special consideration to a particular subset of the times se- 
ries chosen because of its apparently coherent oscillations, which 
in effect enhanced the apparent significance of the claimed pe- 
riodicity, while the entire time series is treated uniformly in the 
present analysis (for reasons discussed in section [9j, A third fac- 
tor is that we made no restriction on the allowed frequency of a 
period component, and so openly searched 457 frequencies, where 
IGierlinski et al.l d2008l) concentrated on the ~ 60 frequencies in 
their periodogram below 10~ 3 Hz. This will result in a factor ~ 8 
change in the p-value (since the probability of finding a Tr value 
in a simulation that is larger than the observed in the real data 
scales approximately linearly with the number of frequencies ex- 
amined). If we take Tr, to be the largest residual at frequencies 
below 10 -3 Hz (but including all the data in the rest of the mod- 
elling process), we find 21/5000 of the RE J1034+396 simulations 
showed Tr ^ lR bs under these restricted conditions, correspond- 
ing to p — 0.004, which is smaller by about the expected factor. 
A relatively minor difference is the more complete treatment of 
parameter uncertainties using the posterior distribution (which is 
treated in an approximation fashion in the method of lV05l) . One is 
therefore left with a choice between two models that could plausi- 
bly explain the data, a power law spectrum with a strong QPO or 
a bending power law spectrum (with weaker evidence for a QPO). 
The most powerful and least ambiguous confirmation of the reality 
of the QPO feature would come from a independent observation 
capable of both constraining the continuum more precisely and al- 
lowing a sensitive search for the candidate QPO. 

The results of the present analysis of the Mrk 766 data agree 
reason ably well with those previously reported by IVaughan et al.l 
(2003) which were obtained using standard frequentist methods 
(e.g. binning the data and estimating parameters by minimising the 
X 2 statistic). The high frequency slopes are essentially the same, 
but the frequency of the bend differs by a factor of ~ 2.5. This is 
most likely due to the slightly different models used, i.e. bending 
vs . sharply broken powe r laws. (Repeating the frequentist analysis 
of IVaughan etail d2003l) using the bending power law model gave 
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a lower characteristic frequency, more consistent with that of the 
present analysis). 



10.7 Other applications of this method 

The techniques discussed in this paper may find application well 
beyond the specific field for which they were devised (namely, 
analysis of X-ray light curves from Seyfert galaxies), since the 
problems of estimating a noisy continuum spectrum and assess- 
ing the evidence for additional narrow features over and above 
that continuum are common to many fields. Other examples from 
X-ray astronomy include analysis of long timescale light curves 
from Galactic X-ray binaries and Ultra-Luminous X-ray sources 
(ULXs) in order to characterise the low frequency power spectrum 
and search for periodicities (e.g. due to orbital modulation). 

But the applications are by no means restricted to astronomy. 
For example, in geology there is considerable interest in detect- 
ing and characterising periodicities in stratigraphic records of en- 
vironmental change, which may be connected to periodicities in 
external forci ng such as mig ht be expected from Milankovich cy- 
cles (see e.g. Weedon 2003). However, there is controversy over 
the statistical and physical significance of the periodicities in these 
data, which are often dominated by stochastic red noise variations 
( lBailevll2009h . 



11 CONCLUSIONS 

We have presented Bayesian methods for the modelling of peri- 
odogram data that can be used for both parameter estimation and 
model checking, and may be used to test for narrow spectral fea- 
tures embedded in noisy data. The model assessment is performed 
using simulations of posterior predictive data to calibrate (sensibly 
chosen) test statistics. This does however leave some arbitrariness 
in the method, particularly in the choice of test statistic^ (and in 
some situations the choice of what constitutes a simulation of the 
data). Such issues were always present, if usually ignored, in the 
standard frequentist tests. The posterior predictive approach has 
the significant advantage of properly treating nuisance parameters, 
and provides a clear framework for checking the different aspects 
of the reasonableness of a model fit. The issue of choosing a test 
statistic does not arise in more "purist" Bayesian methods such as 
Bayes factors, which concentrate on the posterior distributions and 
marginal likelihoods, but such methods of model selection carry 
their own burden in terms of the computational complexity and the 
difficulty of selecting (and the sensitivity of inferences to) priors on 
the model parameters. The method presented in this paper, making 
use of the posterior predictive checking, is an improvement over the 
currently popular methods that use classical p- value; but Bayesian 
model selection is an area of active research and it is not unreason- 
able to expect that new, powerful and practical computational tools 
will be developed or adapted to help bridge the gap between the 
pragmatic and the purist Bayesian approaches. 

The routines used to perform the analysis of the real data pre- 



In situations where two competing models can be modelled explicitly 
the LRT provides a natural choice of statistic. 



sented in section[9]will be made available as an script from the 
author upon request. 
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APPENDIX A: SIMULATING FROM THE POSTERIOR 

Here we briefly discuss a method for simulating data from the pos- 
terior density, which is useful for two main reasons. For simple 
models with few parameters it may be possible to make inferences 
from the posterior without the need for Monte Carlo simulations, 
e.g. by directly evaluating the posterior density on a fine grid of 
parameter values. However, even in this case simulations from the 
posterior are needed in order to form the posterior predictive distri- 
bution, and hence the distribution of a test statistic and its posterior 
predictive p- value. For more complicated models or a greater num- 
ber of parameters Monte Carlo methods may be necessary simply 
in order to calculate summaries of the posterior (such as means and 
intervals). 

Markov chain Monte Carlo (MCMC) methods provide a pow- 
erful and popular method for drawing random values from the pos- 
terior density. General introductions to MC MC computations fo r 
Bayesian poste r ior calculation s are given bv lGelman et alj (2004): 
iGregorvl ( [2q05T):lAlbertl J20q7h, and more thorough tre a tments may 
be fou n d irirTierneyl (ll994 h: IChib & Greenberd jl995h : lGilks et all 
dl995h : [Gamermani 1 1 99% . 

The output of an MCMC calculation is a series of parame- 
ter values (or vectors) G f for t — 0, . . . , L (where L is the num- 
ber of simulations performed, i.e. the length of the chain). The 
Metropolis-Hastings MCMC algorithm generates a sequence of 
random draws as follows: 



• Draw a starting point in parameter space 6° for which 

P (e\i,H)>o. 
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• Repeat for t = 1, . . . , L: 

(i) Draw a proposed new parameter point 9* from a proposal 
distribution g(0*\0 t ~ 1 ) that is conditional only on the previous 
point t_1 . 

(ii) Evaluate the ratio 



p(e*|i,g) fl (fl'|fl t - 1 ) 

p(e*- i ii,.ff)fl(0 t " 1 in 

(iii) Set the new value of f 



f 



0* with probability min(r, 1) 



0' 



otherwise 



(Al) 



(A2) 



In order to use this algorithm we need to have defined a pro- 
posal density function g(0*\0 t ^ 1 ), from which we can compute 
densities and draw random values, and be able to evaluate the pos- 
terior density at any valid point in parameter space. Notice that only 
the ratio of posterior densities need be calculated (to give r), mean- 
ing that we can use the unnormalised posterior density q(0\I, H) 
in the computation. The remarkable property of the MCMC algo- 
rithm is that the distribution of the output 0* converges on the target 
distribution p(0\l,H) for any form of proposal distribution (see 
iTiernevI fl99llGilks etZHl995l for regularity conditions). 

The choice of proposal density does however affect the speed 
of convergence to the target distribution (i.e. the efficiency of the 
MCMC calculation) - the algorithm will be most efficient when 
the choice of proposal density closely matches the posterior den- 
sity. We use as the proposal density a Normal random walk, specif- 
ically a multivariate Normal distribution centred on t_1 with the 
covariance £ from the Normal approximation to the posterior (see 
section[7]l. This is a popular and well understood choice of proposal 
and has been discussed extensively in the MCMC literature. In fact, 
it is usually better to use g(0*\0 t ^ 1 ) = iV(0* _1 ,cE) where c is a 
constant scale factor (> 1) tun ed to improve the efficiency of the 
calculation (see section 1 1.9 of lGelman et alj|2004l ). In the present 
analysis we found c = 1.2 to work well. As the normal distribution 
is symmetric, i.e. g^O 1 ^ 3 ) = g(9 :> \9 1 ), the ratio r simplifies to 
the ratio to the posterior densities r = q(6*\I, H)/q(0 t ~ 1 \I, H). 
(We also found that a multivariate Student's ^distribution worked 
comparably well, with a covariance matrix Hv/(y — 2), and v = 3 
degrees of freedom.) 

One must take some care to ensure the output of the MCMC 
has reached its stationary distribution and is e fficiently generat- 



ing dr a ws from the compl e te posterior densit y. [Gelman & Rubin 



1992), iGilks et alj Jl995t) . iKass et alj J 19981) and iGelman et al 



2004) offer advice for checking the quality of the output. We cal- 



culate J separate chains, each starting from a different initial po- 
sitions 0° spread over the parameter space, and check for conver- 
gence before merging the results. In order to remove any depen- 
dence on the initial position we retain only the second half of each 
chain. We the n compute the R statistic l IGelman & Rubin! 1 19921 : 
Gelm an et al. 2004O which compares the variance within each 
chain to the variance between the chains. With a sufficiently large 
number of iterations R will be close to unity, which indicates that 
the chains have reached the desired stationary distribution. 

In practice we discarded the first half of each chain (the "burn 
in" phase) to remove any dependence on the starting position, and 



12 In IGelman & Rubii] < 19921) and the first edition of the book 
IGelman et alj fe004l) this statistic was called B 1 / 2 . 



required R < 1.1 from the remaining half of the chain before as- 
suming the stationary distribution has been reached. We also in- 
spect, for each chain, the acceptance rate, and the time series and 
histograms of the parameters, which may also reveal problems with 
convergence or efficiency of the chains. Once convergence has been 
reached we combine the remaining L/2 points from each chain to 
yield JL/2 sets of parameter values sampled from the posterior 
distribution. 

Given a sufficient number of simulations, f , we can estimate 
the posterior distribution of any quantity of interest, such as the 
untransformed parameters, and from these estimate the posterior 
means (or modes, or medians) and credible intervals by calculating 
the sample means and quantiles. If necessary we can also simu- 
late data I rop from the posterior predictive distribution by sampling 
parameters from the MCMC output, rcp , and for each point cal- 
culating S(0 Tcp ) and then randomly drawing periodogram points 
according to eqn[T3](i.e. a scaled xi distribution). We can then cal- 
culate the posterior distribution of any test statistic T using these 
simulated data, and hence calculate a p- value. 

For the analysis of section [9] we performed an initial fit to the 
data, using a non-linear optimisation algorithm to find the posterior 
mode and covariance matrix, and used this to form the proposal dis- 
tribution for the MCMC. Five chains of length 30, 000 were gen- 
erated from different locations randomly dispersed around the pos- 
terior mode. The first half of each chain (the "burn-in phase") was 
discarded and, after checking convergence was achieved (as mea- 
sured using the diagnostics discussed above), the remaining 75, 000 
values were merged into a single sample. 



APPENDIX B: SIMULATION OF TIME SERIES 



The calculation of a sample from the posterior, 



IF 



,ff),re- 



quires only the output from an MCMC such as outlined above. Pos- 
terior predictive checks require the generation of simulated (repli- 
cated) periodogram data I rcp . The simplest approach is to draw a 
random parameter vector from the posterior, , generate the cor- 
responding power spectral density function Sj(9 t ) at frequencies 
j — 1, . . . , N/2 — 1 and multiply each of these by a random draw 
from the xl (exponential) distribution (eguationll3t 



(Bl) 



(where Xj are independent random variables drawn from the xl 
distribution). The periodogram at the Nyquist frequency j = N/2 
should be multiplied by a random draw from a \l distribution in- 
stead. (The zero frequency component can be safely given zero 
power.) 

Time series may be generated by inverse Fourier transforming 
the randomised periodogram into the time domain (with time steps 
AT), with appropriate phase randomisation. However, the resulting 
Fourier transformed data are strictly periodic with a period N, and 
so there is a wrap-around effect where the start and end of the time 
series are forced to converge. Also, this procedure does not include 
any effects due to transfer of power from frequencies just below 
or above the observed range. More realistic data may be generated 
from a posterior draw f by calculating a power spectrum over a 
wider range of frequencies than are included in the data, e.g. over 
a frequency grid fk = k/VNAT with k = 1, . . . ,K, where V ^ 
1 and W ^ 1 are the factors by which the lowest and highest 
frequencies are extended (respectively), and K = VWN/2. The 



A Bayesian test for periodic signals in red noise 15 



p ower spectral densities S k(Q t ) may then be used in the algorithm 
of iDavies & Hartd G987{3 to produce a time series. 

An alternative, but mathematically equivalent method, is as 
follows: 

• Generate K — 1 random periodogram ordinates 7^ cp by mul- 
tiplying the spectrum Sk(d t ) with random xl variables as in equa- 
tion |BT] (At the Nyquist frequency, k — K, use a Xi variable in- 
stead of X k /2.) 

• Generate K — 1 independent, random phases 4>k over the range 
[— 7r/2, 7r/2) from a uniform distribution. (At the zero and Nyquist 
frequency use cj>k = 0.) 

• Produce a complex vector Fk = Ak exp(— i(j>k) with argu- 
ments Ak — and phases <f>k ■ 

• Extend the vector (of Fourier amplitudes and phases) to neg- 
ative frequencies, setting the Fourier components for the negative 
frequencies F-k = where the asterisk denotes complex con- 
jugation. (Note that the Fourier components are real valued at the 
zero and Nyquist frequencies.) 

• Inverse Fourier transform the {Fk} from the frequency do- 
main to the time domain. 

The resulting series will be 2K(— VWN) points in length, 
with a sampling rate of AT/W and a duration of VNAT, and will 
have a mean of approximately zero. (Time series that more closely 
resemble those of accreting com pact objects can be obtained using 
the exponential transformation o fluttlev et alj20o3 .) One may then 
resample a segment of this to match the sampling pattern of the ob- 
servation, to give a time series of TV points, as required (and with no 
wrap-around effect). For most processes it should make little dif- 
ference whether the noise due to "measurement error" is included 
in the power spectrum, or excluded from the power spectrum and 
added at the resampling stage (e.g. by drawing the counts per bin 
from the Normal or Poi sson distribution af ter appropriate normali- 
sation of the series). See lUttlev et alj {2002) for more on time series 
simulation. 



This paper has been typeset from a TpX/ KTpX file prepared by the 
author. 



3 An equivalent algorithm for gene rating time series from a power spec- 
trum was introduced to astronomy by Timmer & Konig ( 1995). 



